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Abstract. We review some aspects of Bayesian and frequentist interval 
estimation, focusing first on their relative strengths and weaknesses 
when used in "clean" or "textbook" contexts. We then turn attention to 
observational-data situations which are "messy," where modeling that 
acknowledges the limitations of study design and data collection leads 
to nonidentifiability. We argue, via a series of examples, that Bayesian 
interval estimation is an attractive way to proceed in this context even 
for frequentists, because it can be supplied with a diagnostic in the form 
of a calibration-sensitivity simulation analysis. We illustrate the basis 
for this approach in a series of theoretical considerations, simulations 
and an application to a study of silica exposure and lung cancer. 

Key words and phrases: Bayesian analysis, bias, confounding, epi- 
demiology, hierarchical prior, identifiability, interval coverage, obser- 
vational studies. 



1. INTRODUCTION 

The conventional approach to observational-data 
analysis is to apply statistical methods that assume 
a designed experiment or survey has been conducted. 
In other words, they assume that all unmodeled 
sources of variation are randomized under the de- 
sign. In most settings, deviations of the reality from 
this ideal are dealt with informally in post-analysis 
discussion of study problems. Unfortunately, such 
informal discussion seldom appreciates the potential 
size and interaction of sources of bias and, as a con- 
sequence, the conventional approach encourages far 
too much certainty in inference (Eddy, Hasselblad 
and Schachter, 1992; Greenland, 2005, 2009; Green- 
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land and Lash, 2008; Molitor et al., 2009; Turner et 
al., 2009). 

The entrenchment of the conventional approach 
derives in part from the fact that realistic models 
for observational studies are not identified by the 
data, a fact which renders conventional methods and 
software useless (except perhaps as part of a larger 
fitting cycle). The most commonly proposed mode 
of addressing this problem is sensitivity analysis, 
which, however, leads to problems of dimensional- 
ity and summarization. The latter problems have in 
turn been addressed by Bayesian and related infor- 
mal simulation methods for examining nonidentified 
models (which are often dealt with under the topic 
of nonignorability) . These methods include hierar- 
chical (multilevel) modeling of biases (Greenland, 
2003, 2005), which is intertwined with the theme of 
the present paper. 

We start in Section 2 by reviewing some notions 
of interval estimator performance, with emphasis on 
coverage averaged over different parameter values. 
Section 3 then extends this discussion to include in- 
tervals arising from hierarchical Bayesian analysis 
when data from multiple studies are at hand. These 
two sections reframe existing theory and results in a 
manner suited for our present needs. We emphasize 
a well-known tradeoff: To the extent the selected 
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prior distribution is biased relative to reality, the 
coverage of a Bayesian posterior interval will be off, 
but perhaps not by much; and in return the inter- 
vals can deliver substantial gains in precision and re- 
duced false-discovery rates compared to frequentist 
confidence intervals. In addition, hierarchical priors 
provide a means to reduce prior misspecification as 
studies unfold. 

In Section 4 we turn to the more novel aspect 
of our work, by studying the case which we believe 
better captures observational-study reality, in which 
priors are essential for identification. Here the usual 
order of robustness of frequentist vs. Bayesian pro- 
cedures reverses: Confidence intervals become only 
extreme posterior intervals, obtained under degener- 
ate priors, with coverage that rapidly deteriorates as 
reality moves away from these point priors. In con- 
trast, the general Bayesian framework with proper 
priors offers some protection against catastrophic 
undercover age, with good coverage guaranteed un- 
der a spectrum of conditions specified by the inves- 
tigator and transparent to the consumer. Section 5 
summarizes the lessons we take away from our ob- 
servations and makes a recommendation concerning 
the practical assessment of interval estimator per- 
formance. We conclude that Bayesian interval es- 
timation is an attractive way to proceed even for 
frequentists, because its relevant calibration proper- 
ties can be checked in each application via simula- 
tion analysis. We close with an illustration of our 
proposed practical approach in an application to a 
study of silica exposure and lung cancer in which an 
unmeasured confounder (smoking) renders the tar- 
get parameter nonidentified. 

2. THE WELL-CALIBRATED LAB 

Let 9 denote the parameter vector, and D the ob- 
servable data, for a study that is to be carried out. 
Assume for now that the distribution of (D\9)(i.e., 
"the model") is known correctly. Say that <p = d(Q) 
is the scalar parameter of interest, and that 1(D) 
is an interval estimator for this target. We define 
the labwise coverage (LWC) of / with respect to a 
parameter- generating distribution (PGD) P as 

(l) C(l,P) = Vv{4>e 1(D)}. 

Here the probability is taken with respect to the 
distribution of (6,D) jointly, with 8 ~ P and (D\9) 
following the model distribution. 

Interval coverage with respect to a joint distribu- 
tion on parameters and data, as in (1), has been 



considered by many authors, but not with a consis- 
tent terminology. While it might be temping to refer 
to (1) as "Bayesian" coverage, we find this confus- 
ing since (1) can be evaluated for Bayesian or non- 
Bayesian interval estimators. We choose to call it 
labwise coverage since C(I,P) is the proportion of 
right answers reported by a lab or research team 
applying estimator / in a long series of studies of 
different phenomena (different exposure-disease re- 
lationships, say) within a research domain. The role 
of the PGD P is then to describe the corresponding 
across-phenomena variation in the underlying pa- 
rameter values. Interest in labwise coverage might 
be very direct in some contexts, in that estimator 
operating characteristics in a long sequence of ac- 
tual studies really are the primary consideration. Or 
interest may be more oblique, in that performance 
on the "next" study is of interest, and this perfor- 
mance is being measured conceptually by regarding 
the next study as a random draw from the popula- 
tion of "potential" or "future" studies. 

If / is a frequentist confidence interval (abbrevi- 
ated FCI), then it will attain nominal coverage ex- 
actly for any PGD. That is, if Pr{0 G I(D)\8} = 
1 — a for every value of 9, then C(I, P) = 1 — a for 
any P. Thus, correct coverage for a hypothetical se- 
quence of studies with the same parameter values 
implies correct coverage in the more realistic setting 
of repeatedly applying a procedure in a sequence 
of differing real problems. While this fact is often 
viewed as a robustness property of an FCI, Bayarri 
and Berger (2004), citing Neyman (1977), empha- 
size that it is the labwise coverage that is relevant 
for practice. Put another way, if a lab is well cali- 
brated in the LWC sense of producing 95% intervals 
that capture the true parameter for 95% of stud- 
ies, and the cost of failing to capture is the same 
across studies (as might be the case in some genome 
studies or screening projects), there is little obvious 
benefit if the intervals happen to also have correct 
frequentist coverage. 

2.1 Bayesian Intervals under PGDs 

For a given choice of prior distribution n on the 
parameter vector 9, a 1 — a Bayesian posterior credi- 
ble interval (BPCI) for the target parameter <f> would 
be any interval having Bayesian probability 1 — a of 
containing eft given the observed data D. The most 
common choices of BPCI are the equal-tailed BPCI 
(i.e, the interval formed by the a/2 and 1 — a/2 
posterior quantiles of the target parameter), and 
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the highest-posterior- density (HPD) BPCI. Though 
HPD intervals are optimally short, we consider only 
equal-tailed intervals here, given their simple inter- 
pretation and widespread use. 

If the prior II and the PGD P coincide, then a 
BPCI is guaranteed to have correct labwise cover- 
age. This strikes us as a fundamental property of 
BPCIs, though it is surprisingly unemphasized in 
most introductions to Bayesian techniques. Hence- 
forth, we refer to a BPCI arising from a prior dis- 
tribution set equal to the PGD as an omniscient 
or "oracular" BPCI (abbreviated OBPCI), in the 
sense that the investigator is omniscient in know- 
ing the actual PGD giving rise to future studies. It 
is indeed a fanciful assumption to think that the 
PGD would be known exactly, so throughout this 
paper we pay much attention to nonomniscient BP- 
CIs (abbreviated NBPCI). That is, we will evaluate 
lab-wise coverage when the investigator's prior dis- 
tribution II differs from the PGD P. 

It is worth noting that BPCIs have desirable prop- 
erties from a decision-theoretic point of view. The sit- 
uation is complicated in that both coverage and 
length must be reflected in the loss function. Hence, 
this function must be bivariate, or be a univariate 
combination of coverage and length terms (which 
would necessitate some weighting of the two) . Robert 
(1994) gives some general discussion of this point. 
Despite this complication, there are still results which 
link, and come close to equating, BPCIs and admis- 
sible interval estimators (see, for instance, Meeden 
and Vardeman, 1985). Thus, the common argument 
for Bayesian point estimators having desirable fre- 
quentist properties does extend, albeit with compli- 
cations, to the case of interval estimators. 

Additionally, there are large-sample results say- 
ing that in "regular" modeling situations with large 
sample sizes and priors with unrestricted support, 
BPCIs will have frequentist coverage that converges 
to nominal coverage, at every possible set of param- 
eter values. These results are based on obtaining 
a likelihood that dominates the prior given enough 
data; as such, they are not very useful for our pur- 
poses, because later we turn to problems in which 
no such domination occurs. We will however find use 
for a variant of this result in which information is 
accumulated over a sequence of studies. First, how- 
ever, we illustrate the operating characteristics of 
some interval estimators in a simple but relevant 
situation. 



2.2 Example: Mixture of Near-Null and 
Important Effects 

Say that 9 represents the strength of a putative 
exposure-disease relationship (which may indeed be 
one of a sequence of such exposure-disease combi- 
nations to be investigated). For instance, 9 might 
be a risk difference or a log odds-ratio relating bi- 
nary exposure and disease variables. Suppose that 
D is a univariate sufficient statistic such that D\9 ~ 
N(9,o~ 2 ) where a 2 is known. Then (D±q a /20~) can 
be reported as a 100 x (1 — a)% frequentist confi- 
dence interval (FCI) for 9, where q a /2 is the 1 — a/2 
standard normal quantile. 

In the context of observational epidemiology, null 
or minimal effects are common, and large effects 
are rare. Thus, the PGD giving rise to a sequence 
of studies might have most of its mass at or near 
zero. For instance, say the PGD is a mixture of two 
normal distributions: N(0,e 2 ) with weight p and 
iV(0, k 2 e 2 ) with weight 1 — p, for a "small" e and 
k > 1 . This is interpreted as the first component gen- 
erating minimal or near-null associations, while the 
second gives rise to important as well as near-null as- 
sociations, for example, \9i \ < 2e and \9i \ > ke might 
reasonably be described as near-null and important 
respectively. 

We simulate 500,000 parameter-data ensembles with 
£ = 0.05, p = 0.85, k = 8, and a 2 = 0.025. If 9 is a 
log odds-ratio, then these values have exp(#j) within 
(0.91, 1.1) as near-null, and exp(6>j) outside (0.67, 1.5) 
as important. The choice of a 2 = 2/((500)(0.2)(0.8)) 
approximates the amount of information for the log 
odds ratio when comparing two independent groups 
(as in an unmatched case-control study) with 500 
subjects per group and exposure prevalences around 
20%. 

The first two rows of Table 1 give operating char- 
acteristics of the FCI and the (equal-tailed) OBPCI 
as interval estimators for 9 (both at the nominal 95% 
level). Note that when used as the prior distribution 
for 9, the mixture distribution is conjugate, so that 
computation of the OBPCI is straightforward. As 
is consistent with theory, the labwise coverages of 
both procedures are within simulation error of the 
nominal 95%. On average, though, the OBPCI is 
considerably shorter than the FCI, by almost a fac- 
tor of two. This results from the infusion of prior 
information. 

Motivated by taking \9\ < 2e minimal effect, 
we also define the total discovery rate (TDR), false 
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Table 1 

Frequency properties of interval estimators based on 500,000 simulated parameter- data pairs from PGD with e = 0.05, 
p = 0.85, fc = 8. The "omniscient" posterior (OBPI) uses these parameter values; the "nonomniscient" posterior (NBPI) 

uses the values of p and k shown* 





Coverage % 


Avg. length 


TDR % 


FDR % 


FNR % 


FCI 


95.0 


0.62 


6.3 


17.0 


11.5 


OBPI 


95.0 


0.33 


2.2 


0.6 


14.1 


NBPI: 












p = 0.50, k = A 


95.8 


0.41 


2.1 


0.5 


14.2 


p = 0.50, fc = 12 


97.3 


0.46 


3.2 


2.5 


13.3 


p = 0.95, fc = 4 


89.8 


0.24 


0.9 


0.0 


15.2 


p = 0.95, fc = 12 


91.7 


0.26 


1.8 


0.2 


14.4 


N(0,u 2 ) 


94.8 


0.44 


2.1 


0.6 


14.1 


iV(0,0.5z/ 2 ) 


92.1 


0.36 


0.8 


0.0 


15.2 


N(0,2u 2 ) 


96.5 


0.51 


3.5 


3.4 


13.0 



'Simulation standard errors for coverage, TDR w 0.04%. The simulation standard errors for FDR are considerably larger and 
variable, since only a small portion (the TDR) of the simulated pairs contribute to the estimated proportion. 



discovery rate (FDR) and false nondiscovery rate 
(FNR) for interval estimation as follows: The TDR 
is simply the proportion of reported intervals that 
exclude the minimal range, that is, give confidence 
that the effect is not minimal. The FDR is then the 
proportion of these "discoveries" that are false, that 
is, in which the parameter actually does lie in the 
minimal range. Similarly, amongs intervals intersect- 
ing the minimal range, the FNR is the proportion 
for which the target is actually outside this range. 
We can then describe how the OBPCI is more con- 
servative than the FCI: The OBPCI attains a lower 
FDR at the cost of a higher FNR, as evidenced in 
the first two rows of Table 1. 

Investigators are not omniscient. To illustrate con- 
sequences of defective prior information, we examine 
results in which the prior distribution deviates from 
the PGD. Our example is far from a comprehensive 
study of prior misspecification, and we doubt that 
such a study could be done given all the contex- 
tual elements involved. Rather, we wish to illustrate 
some qualitative points that will be relevant later, 
regarding potential consequences of such misspecifi- 
cation. 

Two sets of NBPCI results are given in Table 1. 
The first set corresponds to an investigator using the 
same form of a mixture-normal prior with the cor- 
rect value of e (which defines the notion of a minimal 
effect and so is contextually established), but with 
misspecified values of p and k (choosing p = 0.50 
or p = 0.95, and k = 4 or k = 12). The second set 
corresponds to an investigator who does not eluci- 
date a mixture structure for the prior, but rather 



simply applies a mean-zero normal prior. The case 
where the prior variance r 2 equals the PGD vari- 
ance of v 2 = pe 2 + (1 — p)k 2 e 2 is considered, as are 
the cases where the prior variance is half/double the 
PGD variance. The results in Table 1 underscore 
the disadvantage of the NBPCI relative to the FCI 
and the unattainable OBPCI: The labwise cover- 
age now deviates from nominal. Arguably, however, 
these deviations are modest. Moreover, the NBPCI 
tend to maintain the other attractive features seen 
with the OBPCI, namely, the much shorter average 
length and lower FDR compared to the FCI. Note 
that the deviations from nominal coverage are less 
pronounced and tend toward conservatism when the 
prior is more spread out than the PGD {p = 0.50 
in the first set of results, r 2 = 2v 2 in the second). 
This is not surprising, since NBPCIs will resemble 
FCIs more and more in the "flat" prior limit. In con- 
trast, the deviations can be markedly anticonserva- 
tive when the prior is more concentrated (p = 0.95 
in the first set, r 2 = 0.5z/ 2 in the second). Thus, by 
using very dispersed priors, we can improve the pre- 
cision and reduce the FDR of our intervals with- 
out incurring objectionable deviations from nominal 
coverage. If, however, we "get greedy" and attempt 
to improve performance by using overconfident pri- 
ors, we risk unacceptable deterioration of coverage. 

In this and subsequent examples, we have used 
equal-tailed BPCIs because these are intuitive and 
commonly reported. It is well known, however, that 
for a given data set the HPD interval (or possibly 
region) is the shortest interval with the specified 
Bayesian probability content. In fact, Uno, Tian and 
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Wei (2005) prove an interesting result about labwise 
coverage of HPD intervals, in the case that the prior 
and PGD coincide. They show that the HPD inter- 
val with coverage 1 — a does not always minimize 
average width subject to obtaining labwise coverage 
1 — a. Rather, the minimizing procedure in general 
involves the HPD interval with coverage 1 — a(D), 
such that E{a(D)} = a, where the data-dependent 
coverage level a(D) arises by thresholding the poste- 
rior densities for all studies at the same cutoff value. 
Thus, in cases where the width of the posterior den- 
sity varies across studies, HPD intervals with higher 
(lower) coverage levels will be reported for studies 
with narrower (wider) posterior densities. While we 
do not pursue this further here, it is worth empha- 
sizing that the simple and intuitive interpretations 
associated with using a BPCI of fixed coverage level 
can be sacrificed in order to obtain intervals which 
are narrower on average. 

Given the performance issues illustrated in Table 
1, we find it overly simplistic to argue against the use 
of Bayesian interval estimation simply because prior 
specification is required, and because it is impossible 
to get this specification exactly right in terms of 
matching the PGD. Furthermore, as we discuss in 
the next section, by using a hierarchically structured 
prior, one can effectively move the prior closer to the 
PGD as a sequence of studies unfolds. 

3. HIERARCHICAL PRIOR DISTRIBUTIONS 

As we have emphasized, consideration of a se- 
quence of studies is a conceptual device capturing 
the reality facing most investigators. Studies in 
medicine and public health are nowhere near iden- 
tical in design, conduct and population studied and, 
hence, there is no basis for asserting parameter equal- 
ity across these studies. This fact is the rationale 
for random-effects models in meta-analysis, which 
typically employ very simple models for the PGD. 
When a new study is performed, however, data from 
m previous studies can be used to improve the prior 
distribution for 9 by using a hierarchically struc- 
tured prior distribution, which in turn will make the 
labwise coverage closer to nominal (as m increases, 
particularly). This further strengthens the frequen- 
tist appeal of Bayesian intervals. 

Say that the study to be carried out has parameter- 
data ensemble (D, 9) and is preceded by m earlier 
studies with ensembles (D\,Q\), . . . , (D^, 9^). If the 
m + 1 ensembles are independent and identically dis- 
tributed (i.e., each according to the PGD and the 



data model) , it makes sense to allow the interval es- 
timator for 9 to depend on the earlier data as well 
as the current data. The labwise coverage (1) then 
generalizes to 

(2) C(I,P) = Pt{^>eI(D;D*)}, 

where D* = (£>*, . . . , D^) and the probability is taken 
with respect to the joint distribution of the m + 1 
parameter-data ensembles. 

The standard Bayesian approach to borrowing 
strength across studies involves a hierarchical prior. 
That is, the prior n asserts that the m + 1 compo- 
nents of (9*, 9) are independent and identically dis- 
tributed given a further parameter vector A. Then A 
itself is assigned a prior distribution. Application of 
Bayes theorem to form the posterior distribution on 
9 involves the likelihood contribution of D\9, with 
U(9\D*) = JTl(0|A)n(A|Z)*)dA playing the role of 
the prior. That is, the earlier studies inform the 
value of A, which in turn informs 9, in advance of ob- 
serving D. If the PGD is well approximated by the 
posited (9\X) prior for some value of A (say, Ao), and 
if the number of previous studies m is large, then 
n(A|D*) should be concentrated near Ao- Thus, the 
"effective prior" being applied to 9 will be close to 
the PGD, which should result in labwise coverage 
for BPCIs that is close to nominal. 

The use of hierarchical priors to "borrow strength" 
across studies and the evaluation of coverage along 
the lines of (2) originates under the rubric of "em- 
pirical Bayes" procedures (see, for instance, Morris, 
1983), which typically involve a non-Bayesian ap- 
proximation to n(A|.D*). With the advent of bet- 
ter algorithms and machines for Bayesian compu- 
tation, however, fully Bayesian "hierarchical model- 
ing" is now commonplace. It should also be noted 
that treating the parameter values for the m+1 
studies as exchangeable as described is a modeling 
assumption that will sometimes be inappropriate. 
Notably, in a situation where all studies focus on the 
same relationship at different calendar times, the as- 
sumption may be dubious but may be weakened to 
allow for trends. For instance, the "Ty Cobb" ex- 
ample in Morris (1983) involves explicit modeling 
of a time trend for parameter values corresponding 
to consecutive calendar years. Analogously, the as- 
sumption may be weakened to allow for group ef- 
fects; for example, the occupational-cancer exam- 
ple in Greenland (1997) explicitly models changes 
in association over cancer type and exposure type. 
In such examples it is a residual component of the 
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study parameters (after time or group effects are re- 
gressed out) that is assumed exchangeable. The ex- 
changeability inherent in assuming the original study 
parameters are conditionally i.i.d. would be appro- 
priate only if no important information is conveyed 
by the time order or other known and varying char- 
acteristic of the parameters being modeled as an en- 
semble. 

To give a simple illustration, suppose again that 
D\9 ~ N(9,a 2 ) with a 2 known. For computational 
ease we first consider a simpler PGD than previ- 
ously, namely, a N(Xq,t 2 ) distribution. Consider a 
partially omniscient investigator who knows the vari- 
ance of the PGD, but not the mean, and in hierar- 
chical fashion assigns the prior 9\X ~ N(X,t 2 ), X ~ 
N(5,lo 2 ). The marginal prior distribution is then 
9 ~ N(5, t 2 +uj 2 ). In the absence of previous studies, 
or with an i.i.d. prior assuming independence of 9 
and 9* , the posterior on 9 would arise from combin- 
ing this prior with D, and the discrepancy between 
this prior and the PGD would induce some degree 
of non- nominal labwise coverage. With a correct hi- 
erarchical prior, however, the previous data will pull 
IT(A|L>*) toward Ao, and hence pull H(9\D*) toward 
the PGD. 

The present setting is sufficiently simple that the 
LWC given in (2) can be computed directly (one- 
dimensional numerical integration is required, but 
repeated simulation of data is not). As an exam- 
ple, suppose a = 1, and the PGD has Ao = 3, r = 1. 
Consider four prior distributions for A, with uj = 1, 
and 5 = 0, 1,2,3. Note that these priors range from 
very bad (mean of the PGD lies three prior stan- 
dard deviations away from the prior mean) to un- 
realistically good (mean of the PGD coincides with 
the prior mean). Figure 1 illustrates the coverages 




10 20 30 5 10 15 20 25 30 

m m 



Fig. 1. Labwise coverage and interval length for the nominal 
95% BPCI, as a function of the number of previous studies 
m. Coverage is given for four choices of hyperparameter 8, 
whereas the length does not depend on 8. The dashed horizon- 
tal line in the second panel corresponds to the length of the 
OBCPI. 



(2) for the resulting 95% BPCIs, as the number of 
previous studies m increases. When m = (thought 
of as either no previous studies, or as an i.i.d. prior 
across studies), the coverage ranges from less than 
80% for the "worst" prior to somewhat above 95% 
for the "best" prior. As Figure 1 illustrates, how- 
ever, for all the priors the coverage converges quite 
quickly to the nominal 95% as m increases, to match 
the OBPCI coverage. The figure also displays the in- 
terval width as a function of m. In this simple setting 
the width is governed by 

Var(0|D,£>*) 

^2 2/2 , 2\-l 

= a t [a + t ) 

(3) 

x [l + a 2 u 2 T~ 2 {(m + l)u} 2 

+ a 2 +r 2 Y\ 

which depends on neither the observed data nor the 
hyperparameter 5. Note that by m = 30 most of the 
potential reduction in width has been realized, that 
is, the width is close to the OBPCI width which 
corresponds to the m — > oo limit of (3). While the 
convergence of coverage and length to match the 
OBPCI represents a well-known calibration feature 
of Bayes and empirical-Bayes estimation, it is inter- 
esting to see how rapidly it can proceed in simple 
settings. 

Of course, the above illustration is very simplistic, 
particularly as the variance components involved in 
the prior are taken to be known, and only the mean 
of the PGD must be learned via previous data. At 
the other end of the spectrum, one anticipates that 
complex PGDs, such as those involving a mixture of 
near-null and important effects, will require a larger 
number of studies before they are estimated well. 
To investigate this, we reconsider the example of 
the previous section, involving a mixture of near- 
null and important effects. Now, however, we treat 
(p, k) as unknown parameters with prior distribu- 
tions p ~ Uniform(0, 1) and k ~ Uniform(4, 20). As 
before, the PGD is based onp = 0.85 and k = 8. Em- 
pirical results on coverage and average length appear 
in Table 2. These are based on only 1000 simulated 
parameter-data meta-ensembles (with each meta- 
ensemble encompassing m + 1 parameter-data en- 
sembles), since the posterior computation is burden- 
some. In particular, a simple Markov chain Monte 
Carlo algorithm (with random walk proposals) is ap- 
plied to sample from (p, k\D, D*), while (9\p, k,D, 
D*) = (9\p,k,D) can be sampled from directly. Re- 
sults for coverage are quite appealing, in that the 
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labwise coverage (2) modestly exceeds nominal when 
the number of previous studies m is small, presum- 
ably because Tl{9\D*) is very flat, and also modestly 
exceeds nominal when m is large, presumably be- 
cause H(8\D*) is close to the PGD. However, the 
very slow convergence of U(6\D*) to the PGD is 
manifested by the average interval width. Even with 
m = 100 previous studies, the average width is still 
44% larger than that of the OBCI. Nonetheless, it is 
23% narrower than the FCI, a worthwhile gain paid 
for by a minor conservatism. 

There are many examples of hierarchical modeling 
in the literature, where unknown means and vari- 
ances are themselves modeled via prior distributions 
or estimated via marginal likelihood. These meth- 
ods have performed quite well in large-scale simu- 
lations and in applications that provide subsequent 
validation (Brown, 2008). Special methodology for 
inference about the distribution of a sequence of ef- 
fects has expanded apace, driven by work on mul- 
tiple comparisons (and particularly false discovery 
rates) in genome studies (see, for instance, Efron et. 
al., 2001 and Newton and Kendziorski, 2003). 

We have emphasized that formal inclusion of pre- 
vious studies on various phenomena within a re- 
search team's domain of study can have positive 
benefits for subsequent studies within this domain, 
in terms of both labwise coverage and average width. 
Consequently, a formal scheme to obtain a prior 
which is close to the PGD for a given domain seems 
desirable when practical. In other circumstances, how- 
ever, it should be possible to informally use previ- 
ous studies in constructing a reasonable prior distri- 
bution. As alluded to earlier, for instance, in many 
sub-fields of epidemiology investigators do have well- 
grounded notions concerning the across-study preva- 
lence of near-null effects and magnitude of impor- 
tant effects. One anticipates that a prior formed 
from direct elicitation of the investigators' views 
should not deviate greatly from a prior formed from 

Table 2 

Empirical properties of interval estimators with a prior 
distribution on {jp, k) and m previous studies. Results based 
on 1000 simulated parameter- data meta- ensembles 



m 


Coverage % 


Avg. length 





96.4 


0.515 


10 


95.9 


0.491 


20 


95.8 


0.487 


100 


95.4 


0.476 



formally updating a "flat" prior based on previous 
studies. Regardless of which route is taken, construc- 
tion of a prior which is reasonably close to the PGD 
for future studies in the domain seems to be a realis- 
tic and worthwhile goal. With this encouraging mes- 
sage in hand, we now turn to examining the use of 
Bayesian interval estimators in nonidentified model 
settings. 

4. INTERVAL ESTIMATION IN 
NONIDENTIFIED MODELS 

The case for BPCIs versus FCIs seems mixed thus 
far, particularly as FCIs are guaranteed to have cor- 
rect labwise coverage, without requiring any knowl- 
edge of the PGD. But in a large class of statisti- 
cal problems, construction of valid FCIs is not pos- 
sible. Recall that in general a model is noniden- 
tified if there are multiple sets of parameter val- 
ues giving rise to the same distribution of observ- 
ables. We have argued that this class of models is 
the only realistic choice in most observational stud- 
ies of human health and society (Greenland, 2005; 
Gustafson, 2006). This is particularly true in disci- 
plines such as epidemiology where honest appraisal 
of what modeling assumptions are justified, and what 
limitations are inherent in the available data, ought 
to lead investigators to nonidentified models rou- 
tinely. 

Identifiable models are desirable when they can 
supply root-n consistent estimators of target pa- 
rameters, as in classic industrial and laboratory ex- 
periments. With study problems such as measure- 
ment error, missing data, selection bias and unmea- 
sured confounders, however, extremely strong as- 
sumptions may be required to attain an identified 
model. Most statistical methods assume absence of 
such problems, and the remainder assume that the 
form of the problems is known up to a few identifi- 
able parameters. Either way, there is a strong pos- 
sibility that the resulting model is grossly misspec- 
ified, with the resulting FCIs exhibiting excessive 
precision and severe undercoverage for the inferen- 
tial target. 

Put another way, using an overly simplified model 
for the sake of identifiability results in root-n consis- 
tent inference for the wrong parameter (e.g., an un- 
conditional association, when the desired inferential 
target is an association conditional on an unmea- 
sured covariate) (Greenland, 2003, 2005; Gustafson, 
2006). If, as usual, the parameter being estimated 
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does not equal the target parameter, the interval 
coverage for the latter will tend to zero as the sam- 
ple size increases. 

Backing away from untenable assumptions may 
result in a model that is better specified (closer to 
reality, or at least better representing the true infer- 
ential target), but which lacks identifiability. There 
is extreme hesitance among statisticians regarding 
the use of nonidentified models, because they do not 
give rise to estimators with familiar statistical prop- 
erties, such as root-n shrinkage of interval estima- 
tors to some value. But for Bayesian analysis there 
is no conceptual or computational difference in how 
inferences are obtained from a nonidentified model 
compared to an identified model. In fact, from a 
radical subjective Bayesian perspective, identifica- 
tion is a matter of a degree and always a function of 
the full prior (including the prior for the data given 
the parameters). 

In summary, in nonidentified problems there is no 
route to FCIs achieving exactly nominal coverage for 
any set of underlying parameter values. If in these 
settings we simplify the model to the point of identi- 
fiability, then FCIs are readily obtained via standard 
methods, but are likely to have grossly incorrect cov- 
erage probabilities due to misspecification. Without 
simplification, models are nonidentified, which pre- 
cludes construction of FCIs having the nominal cov- 
erage probability at every point in the parameter 
space. 

Some frequentist approaches to problems of this 
sort involve (i) specifying bounds (rather than prior 
distributions) on key parameters, and (ii) construct- 
ing interval estimators having at least nominal cov- 
erage at every point in the parameter space, with the 
consequence that the coverage will be higher than 
nominal at most parameter values. Some recent sug- 
gestions along these lines include Imbens and Man- 
ski (2004), Vansteelandt et al. (2006) and Zhang 
(2009); we illustrate such an approach in the first 
of the two examples below. Conversely, the use of 
Bayesian or approximately Bayesian inferences from 
nonidentified models was suggested at least as far 
back as Learner (1974), and has long been discussed 
under special topics such as nonignorable missing- 
ness (Little and Rubin, 2002). It has also attracted 
considerable attention in recent literature; see, for 
instance, Dendukuri and Joseph (2001); Greenland 
(2003, 2005); Gustafson (2005b); Gustafson and 
Greenland (2006a, 2006b); Hanson, Johnson and 
Gardner (2003); Joseph, Gyorkos and Coupal (1995); 



McCandless, Gustafson and Levy (2007, 2008); Scharf- 
stein, Daniels and Robins (2003). 

For Bayesian procedures, the exact attainment of 
nominal labwise coverage by an OBPCI still holds 
under nonidentified models. The result in general 
(for any kind of model) is known, but surprisingly 
unemphasized in the literature (see Rubin, 1984, 
and Rubin and Schenker, 1986, for exceptions). Yet 
it seems to be a useful reference point, as it provides 
a clear calibration, or "anchor," for an interval esti- 
mation procedure in a nonidentified model. On the 
other hand, we generally expect the choice of prior to 
be far more influential on the posterior distribution 
when the model is nonidentified, so that lab-wise 
coverage may deviate rapidly from nominal as the 
prior distribution deviates from the PGD. We inves- 
tigate this phenomenon in the two examples below. 

4.1 Example: Prevalence Survey 
with Nonresponse 

Vansteelandt et al. (2006) illustrate some frequen- 
tist techniques for sensitivity analysis in nonidenti- 
fied models in the following setting. A binary out- 
come Y may be observed (R = 1) or missing (R = 0, 
nonresponse) for each study unit, so that the avail- 
able data consist of n i.i.d. realizations of (RY,R). 
The inferential target is the outcome prevalence, 
-/r = Pr(y = 1), while the missingness may be infor- 
mative, that is, Y and R may be associated. One 
parameterization for this situation is p = Pr(R = 1), 
s = Pr(Y = 1\R = 1), and 7 = logit{Pr(Y = 1\R = 
0)} — 9 where 9 = logit(s). Then the inferential tar- 
get is 7r = (1 — p) expit(0 + 7) + ps. This is a non- 
identified inference problem because the likelihood 
for the observed data depends only on p and s, while 
the inferential target also depends on 7. 

We consider the coverage and average length of 
three interval estimators for ir. The first is the naive 
interval estimator obtained by assuming 7 = 0, that 
is, assuming the missingness is completely at ran- 
dom, and estimating ir as the sample proportion of 
the observed outcomes. The second is an interval 
estimator suggested by Vansteelandt et al. (2006), 
designed to have at least nominal frequentist cov- 
erage (approximately) under every fixed value of 7 
in a specified interval /; we take I = (—2, 2) in the 
present example. Let ir/ and ir u be the estimates 
of ir when fixing the value of 7 at the lower and 
upper endpoints of I respectively. Then the inter- 
val estimator with target level 1 — a is of the form 
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{n -Q a */2se(7Ti),7i- u + q a ,/ 2 se(TT u )), where a* is cho- 
sen to make the minimum coverage as 7 varies in 
I equal to 1 — a (with the minimum attained at 
one of the endpoints) . We refer to this interval as a 
conservative frequentist confidence interval (CFCI). 
The relationship between a* and a depends on the 
unknown parameters, hence, estimates are plugged 
in and the coverage properties become approximate 
rather than exact. Vansteelandt et al. (2006) call in- 
terval estimators of this form "pointwise estimated 
uncertainty regions," since the coverage claim ap- 
plies to the true value of the target parameter. These 
authors also propose "weak" and "strong" estima- 
tors with coverage claims pertaining to the set of 
all target parameter values consistent with the ob- 
served data law (i.e., interval estimation of an inter- 
val). For more details see Vansteelandt et al. (2006). 

The third interval estimator is the equal-tailed 
Bayesian credible interval arising from a uniform 
prior distribution for 7 on the same interval /, along 
with uniform(0, 1) priors for both p and s. Under 
this specification the parameters p, s and 7 remain 
independent of one another a posteriori, with beta 
distributions for p and s arising from binomial up- 
dating, and a uniform posterior distribution on I for 
7; that is, no updating of 7 occurs. 

Empirical labwise coverage and average length for 
nominal 95% intervals are given in Table 3. The PGDs 
used have normal distributions for ft = logit (p) and 
9 = logit(s) with up = logit 0.67, 073 = (logit 0.89 - 
logit 0.67) /2, n e = logit 0.5 and a e = (logit 0.8 - 
logit 0.5)/2. Thus, the PGD for (p,s) concentrates 
around more typical-use scenarios than does the prior 
for these parameters. The PGD is completed by 
7 ~ uniform ( J) for various specifications of interval 
J. Note that one specification is the single-point in- 
terval J = [2, 2] , which corresponds to fixing 7 at the 
endpoint of I, and hence corresponds to a partially 
frequentist evaluation of coverage. Note also that the 
average interval lengths do not depend on the speci- 
fication of J for this problem, since the distribution 
of the observed data (under the joint distribution of 
parameters and data) does not depend on J. Thus, 
the average lengths of 0.11 for the naive interval, 
0.33 for the CFCI, and 0.28 for the BPCI apply for 
any J. 

Table 3 verifies that when J in the PGD and / in 
the prior coincide, the Bayesian intervals have LWC 
within simulation error of nominal, despite the dis- 
crepancy between the uniform priors for (p, s) and 
the logit-normal PGDs for (p,s). In contrast, the 



Table 3 

Empirical coverage probabilities and average lengths for 
nominal 95% interval estimators of a prevalence tt. Results 
are given for naive estimator, the CFCI and the BPI. For 
each choice of PGD (i.e., choice of interval J), results are 
based on 10,000 simulated parameter- data ensembles with a 
sample size of n = 500. Simulation standard errors for 
coverages are 0.5% or less. Both the CFCI and the BPI 
assume an interval range I — ( — 2, 2) for 7 



J in PGD: 


Naive 


CFCI 


Bayes 


J = (-2, 2) 


42% 


99% 


95% 


J =(-3,3) 


30% 


93% 


80% 


J =(-1,1) 


67% 


100% 


100% 


J =(-1,3) 


40% 


95% 


84% 


J = (2,2) 


9% 


95% 


71% 


Average length 


0.11 


0.33 


0.28 



CFCI approach is indeed quite conservative when / 
and J coincide, with labwise coverage of 99% and 
average length 17% greater than the BPCI. As ex- 
pected, the labwise coverage of both the CFCI and 
the BPCI is highly affected by any discrepancy be- 
tween / and J. As advertised, the CFCI achieves 
conservative coverage in all cases, except for a slight 
dip below nominal in the case that J is wider than 
(and contains) /. Note, in particular, that the CFCI 
achieves nominal coverage when 7 is fixed at an end- 
point of I, whereas the BPCI coverage drops to 71% 
in this setting. 

The differences between labwise coverage of BP- 
CIs and CFCIs are somewhat hidden in Table 3, 
since nominal 95% intervals do not have much "room' 
to obtain higher than nominal coverage. Thus, we 
also report results for nominal 80% intervals (Table 
4). Admittedly, such intervals are seldom reported 
in practice (though see Greenland et al., 2000, for 
an exception), but they are useful for gauging the 
extent to which a given interval estimator is con- 
servative. The average lengths of these intervals are 
0.069 (naive), 0.29 (CFCI) and 0.22 (BPCI). When 
I and J match, we now see very substantial over- 
coverage (96%) for the CFCI, with an average width 
30% greater than for the OBPCI. We also see more 
clearly the over-coverage that results for both CFCI 
and BPCI when J is narrower than I. 

The BPCI and the CFCI are constructed to sat- 
isfy different criteria, and we are not attempting to 
argue than one is better than the other. In particu- 
lar, note the tradeoff exhibited in Tables 3 and 4. If 
the investigator has an interval of values / in mind 
for 7, then the CFCI has a conservatism which may 
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Table 4 

Empirical coverage probabilities and average lengths for 
nominal 80% interval estimators of a prevalence n. Both the 
CFCI and the BPI assume an interval range I = ( — 2, 2) for 
7. The table entries are as per Table 2 



J in PGD: 


Naive 


CFCI 


Bayes 


(-2,2) 


27% 


96% 


80% 


(-3,3) 


19% 


83% 


59% 


(-1,1) 


47% 


100% 


98% 


(-1,3) 


26% 


87% 


69% 


(2,2) 


4% 


80% 


31% 


Average length 


0.069 


0.29 


0.22 



be appealing: at least nominal coverage can be ob- 
tained with respect to any averaging across values 
in /, including the selection of single points. On the 
other-hand, if labwise coverage with respect to the 
Uniform(J) distribution is at issue, then the BPCI 
will be shorter on average, and have correct cov- 
erage. We do emphasize that this correct coverage 
constitutes a calibration property of the BPCI which 
the CFCI does not possess. That is, without doing 
simulation, we do not know to what extent the CFCI 
based on interval / will exhibit higher than nomi- 
nal labwise coverage when the PGD is based on /. 
But we do know automatically that the BPCI using 
/ in the prior will exhibit correct labwise coverage 
when the PGD is based on /. Thus, the BPCI is an- 
chored via the investigator's knowledge that exactly 
nominal coverage would be obtained in a sequence 
of studies with PGD equal to the prior, and presum- 
ably at least nominal coverage would eventually be 
attained in a sequence of studies in which the sup- 
port of the prior contains the PGD. In this sense, 
posterior coverage is conservative precisely when the 
prior is conservative relative to the PGD. The CFCI 
labwise coverage has a more murky connection to 
the PGD, which is the price it pays for obtaining 
correct frequentist coverage at the endpoints of the 
prior interval I. 

4.2 Example: Case-Control Study 
with Misclassification 

Consider an unmatched case-control study of the 
association of a disease indicator Z and a binary ex- 
posure indicator X, with X subject to independent 
nondifferential misclassification. Let tq = 
Pr(A~ = 1\Z = 0) and r x = Pr(X = l\Z = 1) be the 
prevalences of actual exposure among nondiseased 
and diseased source population members, and let 



SN = Prpf* = l\X = 1) and SP = Vr(X* = Q\X = 
0) be the sensitivity and specificity of the expo- 
sure classification in the study. The numbers ap- 
parently exposed among the no nondiseased con- 
trols and n\ diseased cases in the study are mod- 
eled as Yi ~ Bin(nj,#j) for i = and i = 1 respec- 
tively, with 9i = nSN + (1 - n)(l - SP) = Pr(X* = 
l\Z = i). If all four parameters (ro, n, SN, SP) are 
unknown, then this model is not identified by the ob- 
served counts (yi,yo,ni — yi,no — yo). Bayesian in- 
ference under this model is considered by Gustafson, 
Le and Saskin (2001), Gustafson (2003), Greenland 
(2005), Chu et al. (2006) and Gustafson and Green- 
land (2006a), among others. 

We consider prior distributions and PGDs of the 
following form: A bivariate normal distribution for 
the logit prevalences (logit ro, logit ri), with correla- 
tion p and identical marginals (mean ft and variance 
t 2 ). The log-odds ratio, /? = logit(ri) — logit(ro), is 
then distributed as iV{0, (1 — p)2r 2 }. The correla- 
tion is essential to reflect the fact that information 
about the exposure prevalence in one group would 
alter bets about the prevalence in the other group, 
due to prior information about (3 (Greenland, 2001). 
57V and SP are here taken as independent of the ex- 
posure prevalences and each other, however, with 
SN ~ Beta(aAr, 6jy) and SP ~ Beta(ap, bp); more 
realistic priors might allow dependent 57V and SP 
(Chu et al., 2006; Greenland and Lash, 2008), or one 
could instead reparameterize the problem to make 
prior independence reasonable (Greenland, 2009). 

Bayesian computation is readily implemented via 
the efficient algorithm of Gustafson, Le and Saskin 
(2001). While this algorithm takes advantage of struc- 
ture imbued by assigning uniform priors on preva- 
lences, we can use importance sampling to adapt the 
algorithm output to the present prior specification. 
As an example, m = 10,000 parameter-data ensem- 
bles with n\ = ri2 = 500 are drawn from the PGD 
based on p = —2.3, r = 1.17, p = 0.76, ajy = ap = 
18, a at = ap = 4. These choices produce a 95% logit- 
symmetric interval for each of (0.01,0.50) and a 
95% log-symmetric interval for e^ of (0.2,5.0). Also, 
the modes of the SN and SP distributions are 0.85, 
with 95% logit-symmetric intervals of (0.637,0.946). 

For each data set, seven interval estimates for /3 
are constructed: 

(i) the standard FCI assuming no misclassifica- 
tion; 

(ii) an FCI derived by taking 57V = 0.85 and 
SP = 0.85 as known values; 
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(iii) the omniscient BPCI arising when the prior 
distribution coincides with the PGD; 

Nonomniscient BPCIs with priors based on cor- 
rect specification of (/x,t, p) but: 

(iv) a 7v = ap = 9.5, bjy = bp = 2.5 (keeping the 
prior modes on SN and SP at 0.85 but making the 
distribution more diffuse); 

(v) oat = ap = 26.5, b]y = bp = 5.5 (modes at 
0.85 but overly concentrated); 

(vi) clm = ap = 23.5, b^ = bp = 8.5 (still overly 
concentrated and modes shifted down to 0.75); 

(vii) otv = ap = 28.5, 6yv = bp = 3.5 (still overly 
concentrated and modes shifted up to 0.95). 

Empirical properties of the interval estimators (at 
the nominal 95% level) are described in Table 5. 

In the previous example, the joint posterior den- 
sity was a product of the marginal posterior den- 
sity for the two parameters appearing in the like- 
lihood function and the marginal posterior density 
(equal to the prior density) for the one parameter 
not in the likelihood. This factorization simplified 
the mathematics of how the prior influences the pos- 
terior distribution of the target parameter. In the 
present example, however, the structure of the prob- 
lem is more nuanced. As emphasized by Gustafson, 
Le and Saskin (2001), the support of the two pa- 
rameters not in the likelihood, (SN,SP), depends 
on the values of the two parameters in the likeli- 
hood, (^o^i); since by construction 1 — SP and SN 
must straddle both 9i values. To some extent then, 
the posterior distribution of (SN,SP) can depend 
on the data, even though these parameters do not 
appear in the likelihood function. Gustafson (2005a) 

Table 5 

Empirical labwise properties of nominal 95% interval 
estimators for a log odds ratio /3 based on 10,000 simulated 
parameter- data ensembles. The simulation standard errors 
for coverage are less than 0.5%. Results for estimator (ii) 
are based only on the 81 % of ensembles for which the method 

works 



Coverage Avg. length 



(i) -FCI 44% 0.60 

(ii) -FCI 81% 2.20 

(iii) -OBPI 95% 2.02 

(iv) -NBPI 95% 2.12 

(v) -NBPI 94% 1.94 

(vi) -NBPI 95% 2.32 

(vii) -NBPI 87% 1.56 



discusses such indirect learning about parameters in 
nonidentified models in more general terms. 

Given that the data can provide some informa- 
tion about (SN,SP), one might anticipate that the 
NBPCI coverage would be less sensitive to the choice 
of prior than in a situation without any indirect 
learning. The results in Table 5 bear this out, with 
the coverage of nominal 95% NBPIs ranging from 
87% to 95% across the priors considered. In accord 
with theory, the OBPCI coverage is within simula- 
tion error of nominal, which can be regarded as a 
check that our scheme for posterior computation is 
working adequately (see Cook, Gelman and Rubin, 
2006, for elaboration). 

While the link between (SN,SP) and (0 o ,0i) is 
exploited to advantage under a Bayesian analysis, 
it is problematic for the FCI based on taking SN 
and SP as fixed values less than one. In particu- 
lar, the FCI is not defined for data sets with one 
or both 9i falling outside the interval (1 — SP, SN). 
Moreover, this can happen via sampling variation 
even if the postulated values of (SN,SP) happen 
to be correct. Tu, Litvak and Pagano (1994, 1995) 
discussed this problem, and offered some mitigating 
strategies when exposure prevalence (say, in a sin- 
gle population) is the inferential target of interest. 
Such strategies yield interval estimates for preva- 
lence with an endpoint at zero or one, which lim- 
its their utility for odds-ratio inference. In our case, 
the results for estimator (ii) in Table 5 are based 
on only the 81% of sampled parameter-data ensem- 
bles not giving rise to the aforementioned problem. 
Perversely, this method is failing in situations where 
the data are most suggestive that the guessed val- 
ues of (SN,SP) might be wrong. Put another way, 
the FCI fails on data sets where Bayesian intervals 
may do particularly well via more prior-to-posterior 
updating of (SN,SP). 

As a final point concerning this example, we rec- 
ognize that it is quite reasonable to also study the 
frequentist properties of the Bayesian interval esti- 
mator. This can become quite computationally bur- 
densome, however, if evaluation of frequentist cover- 
age at many points in the parameter space is desired: 
each point necessitates simulation of many data sets, 
and each data set may require many MCMC iter- 
ations in order to compute the interval estimate. 
Rather than pursuing this course, we note that the 
simulation of parameter-data ensembles as used to 
evaluate labwise coverage also yields information 
about frequentist coverage. 
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Thus, say that the frequentist coverage for param- 
eter vector 9* is of interest. If m parameter-data en- 
sembles are simulated, then we might consider the 
proportion a of ensembles for which 9 is closest to 
9* in some sense. Then the empirical coverage for 
this subset of ensembles approximates the frequen- 
tist coverage at 9*, with the approximation improv- 
ing as o. — y and ma — > oo. Admittedly, it may be 
computationally prohibitive to make the approxi- 
mation error very small, so we refer to the reported 
coverage as "near- frequentist coverage" around 9*. 
Notwithstanding its approximate nature, this can 
still reveal trends in frequentist coverage across the 
parameter space. 

To apply this to the present example, we extend 
the simulation size to m = 100,000 parameter-data 
ensembles, and set a = 0.01. Various points 9* in the 
parameter space are considered, by fixing ro = 0.10, 
r\ = 0.15, and then setting SN and SP at values 
corresponding to specific prior quantiles. Thus, we 
investigate how the frequentist coverage depends on 
the compatibility between the prior and the true 
SN and SP values. Results appear in Table 6. We 
see under-coverage for SP values which are low in 
relation to the prior, and over-coverage when SP 
or SN is high in relation to the prior. Generally, 
however, the variation in frequentist coverage as SN 
and SP values move around the region supported by 
the prior distribution seems quite modest. 

5. RECOMMENDATIONS 

The above arguments and illustrations are intended 
to summarize and explain in simple form several 
practical recommendations that we and others have 
reached in the course of numerous theoretical stud- 
ies, simulations and real applications. Like others 



before us, we first recommend forming prior dis- 
tributions and then reporting Bayesian interval es- 
timates for parameters of interest, particularly in 
nonidentified model contexts. Based on our investi- 
gations, however, we further suggest that a special 
form of sensitivity analysis be carried out as well. 

Sensitivity analysis is conducted in much applied 
work; typically this involves reporting multiple in- 
ferences corresponding to multiple models and (for 
Bayesians) multiple prior distributions. While these 
analyses are often better than standard reports of 
results from just one model, the resulting collection 
of interval estimates leads to problems of summa- 
rization and interpretation of the collection. Thus, 
we recommend instead that one start with a sin- 
gle, relatively inclusive "covering" prior distribution 
that subsumes the diversity of opinions and possi- 
bilities for the parameters. Then, as a safeguard, we 
would evaluate the labwise coverage of Bayesian in- 
tervals arising from this prior, for a variety of PGDs 
differing somewhat from the prior. If the coverage 
does not fall much below nominal as the PGD de- 
viates from the prior, then we may argue that our 
statistical procedure is probably (in the subjective 
judgmental sense) at least roughly calibrated, in the 
across-study sense of labwise coverage. Otherwise, 
we may consider ourselves alerted to a potentially 
serious miscalibration. 

Table 3, in the context of prevalence surveys with 
nonresponse, provides one example of studying the 
sensitivity of labwise coverage as the PGD deviates 
from the prior distribution. We close with a further 
example from a specific and well-developed scientific 
context. 

5.1 Example: Silica Exposure and Lung Cancer 

We revisit the investigation of Steenland and Green- 
land (2004) on the relationship of silica exposure 



Table 6 

Near- frequentist coverage in the case-control study with misclassification example 







SP* = 0.63 


SP* = 0.77 


SP* = 0.83 


SP* = 0.88 


SP* = 0.95 


SN* 


= 0.63 


90% 


95% 


97% 


97% 


98% 


SN* 


= 0.77 


92% 


98% 


99% 


99% 


99% 


SN* 


= 0.83 


93% 


99% 


99% 


99% 


99% 


SN* 


= 0.88 


92% 


99% 


99% 


99% 


99% 


SN* 


= 0.95 


93% 


98% 


99% 


99% 


98% 



NOTE: Evaluation is for parameter values 8* given by r* = (0.10,0.15) and the indicated values of (SN* , SP*), using a = 0.01 
of the m= 100,000 simulated parameter-data ensembles in each instance. The chosen values for (SN* , SP*) correspond to 
2.5th, 25th, 50th, 75th and 97.5th percentiles of the prior distribution. 
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to lung cancer. In a cohort of 4626 industrial sand 
workers with high silica exposure, 109 lung-cancer 
deaths were observed, compared to an expected count 
of 68.1 under the null hypothesis of no association 
between silica exposure and lung cancer. This com- 
parison of the cohort to US population data is ad- 
justed for age, race, calendar time and sex. It is not 
adjusted for smoking status though, because smok- 
ing histories were not collected for this cohort. 

Steenland and Greenland used prior information 
derived from other studies in order to remedy this 
situation using both Monte Carlo sensitivity analy- 
sis (MCSA) and Bayesian analysis. To describe this 
analysis, let fa be the log relative risk of lung-cancer 
death for silica exposure versus no exposure, within 
strata defined by smoking behavior, and let fa and 
fa be log relative risks for current smokers compared 
to never smokers, and former smokers compared to 
never smokers. Assuming a log-linear model with- 
out products between silica exposure and smoking 
effects, the observed death count can be regarded as 
a Poisson realization with log- mean A, where 

A = c + fa + log(pi + p 2 e P2 + p 3 e ft ) 
-log(q 1 +q 2 e [52 +q 3 e^ 3 ). 

Here c is a known offset obtained from population 
data (c = log 68.1 in the present example), while 
(j)i,P2,P3) and (qi,q2,Q3) are probability distribu- 
tions over (never, current, former) smokers, in the 
exposed and unexposed populations respectively. 
This is a highly nonidentified model, with nine un- 
known parameters involved in the mean function. 
Identification of the target parameter fa can only be 
obtained via a strong assumption, for example, that 
smoking behavior and occupational silica exposure 
are unassociated, that is, (pi,P2,P3) = (li,Q2,<l3), 
which is known to be false. Thus, a far more prin- 
cipled analysis combines the Poisson model for data 
along with prior distributions for {fa, fa, fa), (pi,P2, 
p 3 ) and {qi,q 2 ,q 3 )- 

Based on data from a large cohort study of smok- 
ing and lung cancer, Steenland and Greenland took 
fa ~ iV(log(23.6),0.094 2 ) and independently fa ~ 
A(log(8.7),0.094 2 ). They used smoking data on a 
small sample of 199 workers to inform the prior p ~ 
Dirichlet(199 x (0.26,0.40,0.34)), and used a large 
national survey to inform the prior on q. This survey 
involved 56,000 subjects, but to account for various 
uncertainties it was discounted by a factor of four 
to yield the prior q ~ Dirichlet(14,000 x (0.34,0.35, 



0.31)). Steenland and Greenland used a very diffuse 
prior on fa . This is not appropriate for investigating 
labwise coverage, however, as some data sets simu- 
lated from parameters generated under this prior 
will have implausibly low (i.e., zero) death counts, 
while others will have implausibly large counts. Thus, 
for present purposes we take the prior fa ~ 
N[0, {/n(5)/2} 2 ], which puts most of its weight on 
relative risks between 1/5 and 5. This completes 
specification of the prior distribution. 

Bayesian computation for the present situation is 
readily implemented in a two-stage manner. First, 
an approximate posterior sample is simulated by 
drawing A values "as if" A had a flat prior, and in- 
dependently drawing (fa, fa,pi,p 2 ,P3, ?i, <?2, #3) val- 
ues from their prior distribution. Second, this pos- 
terior sample is "made exact" via importance sam- 
pling, which recognizes the actual prior distribu- 
tion in the (A, fa, fa,pi,P2,P3, Qi, Q2, 93) parameter- 
ization. Note that omitting the second step corre- 
sponds to the MCSA in Steenland and Greenland. 
In the present example this second step has negligi- 
ble impact, though in general importance sampling 
can be used to convert MCSA inferences to fully 
Bayesian inferences in situations where the two do 
not agree so closely. 

Applied to the cohort data, a 95% equal-tailed 
BPCI for exp(fa) is (1.12, 1.73), which is very simi- 
lar to the interval reported by Steenland and Green- 
land using their slightly different prior. For compar- 
ison, the analysis which ignores the confounding ef- 
fect of smoking gives the interval (1.31,1.91). This 
result is based on the same prior for fa as above, 
with the presumption that (pi,P2,P3) = (<7i, 12, 13)- 
Thus, the impact of acknowledging smoking as a 
confounder is to push the interval estimate for fa 
toward (but not across) the null, and to widen the 
interval by about 15%. This widening is somewhat 
modest, since there is relatively good prior data about 
smoking effects and smoking behavior in the two 
populations and the association of smoking with sil- 
ica exposure in these data appears to be small. 

We know that BPCIs based on this prior will have 
correct labwise coverage for a PGD equal to the 
prior. We wish to see how far the coverage devi- 
ates from nominal as the PGD deviates from the 
prior. We thus examine eight PGDs, starting with 
the prior and considering all possible combinations 
of: 

(i) shifting the prior mean for fa left or right by 
one prior standard deviation; 
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Table 7 

Labwise coverage of 95% Bayesian intervals for j3i as the 
PGD varies, in the silica and lung cancer example. The first 

row gives coverage when the PGD equals the prior. 
The remaining eight rows give coverage when the PGD is an 
alteration of the prior. The three- character code describes 
the alteration. The first character (+ or — ) indicates 
whether the mean of 02 is increased or decreased, the second 
character does the same for the mean of j3z, and the third 
character (p or q) indicates whether the prior on p or the 

prior on q is discounted. Results are based on 100,000 
realizations, giving simulation error for coverage less than 
0.1% 



PGD 


Coverage % 


Prior 


94.8 


P 


92.6 


— q 


94.8 


-+P 


93.1 


- + q 


95.2 


+ -P 


92.0 


+ -q 


94.5 


+ +P 


92.4 


+ + q 


94.8 



(ii) shifting the prior mean for left or right by 
one prior standard deviation; 

(hi) discounting the prior on (j>i,p2,P3) by a fac- 
tor of two or (further) discounting the prior on (qi,q2, 
Q3) by a factor of two. 

Table 7 gives coverage results using 95% equal- 
tailed BPIs for (3\ . When the PGD equals the prior, 
the labwise coverage is within simulation error of 
nominal, as theory dictates. As the model is highly 
nonidentified, we are not surprised to see lower than 
nominal coverage for most of the PGDs considered. 
We are pleasantly surprised, however, to see that the 
loss of coverage is very mild. This adds credence to 
the Bayesian results given by Steenland and Green- 
land (2004). 

Based on examples as well as theoretical and sim- 
ulation studies, we recommend that PGD sensitiv- 
ity analysis be used when inference based on non- 
identified models is required. No important sensi- 
tivity was seen in the preceding example. Nonethe- 
less, high sensitivity to plausible PGD specifications 
would have suggested that the full model (including 
those for the prior distribution and data-generating 
mechanism) had inadequately captured posterior un- 
certainty given the actual prior uncertainty of the 
analysts, and that interval estimates from the model 
could be seriously miscalibrated. Hence, as with failed 



regression diagnostics, we would find ourselves ad- 
vised to revise our model rather than rely on it. 

Of course, this advice raises classic issues of the 
impact of post-data model revision based on diag- 
nostics, long recognized as a challenge for applied 
Bayesians as well as for applied frequentists (Box, 
1980). We thus regard these issues as an important 
direction for further research in our proposed ap- 
proach. 
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